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ABSTRACT 

Ribosomal Database Project (RDP; http://rdp.cme. 
msu.edu/) provides the research community with 
aligned and annotated rRNA gene sequence data, 
along with tools to allow researchers to analyze 
their own rRNA gene sequences in the RDP frame- 
work. RDP data and tools are utilized in fields as 
diverse as human health, microbial ecology, envir- 
onmental microbiology, nucleic acid chemistry, 
taxonomy and phylogenetics. In addition to aligned 
and annotated collections of bacterial and archaeal 
small subunit rRNA genes, RDP now includes a col- 
lection of fungal large subunit rRNA genes. RDP 
tools, including Classifier and Aligner, have been 
updated to work with this new fungal collection. 
The use of high-throughput sequencing to charac- 
terize environmental microbial populations has 
exploded in the past several years, and as 
sequence technologies have improved, the sizes of 
environmental datasets have increased. With 
release 11, RDP is providing an expanded set of 
tools to facilitate analysis of high-throughput data, 
including both single-stranded and paired-end 
reads. In addition, most tools are now available as 
open source packages for download and local use 
by researchers with high-volume needs or who 
would like to develop custom analysis pipelines. 

INTRODUCTION 

Ribosomal Database Project (RDP) ILl, released in 
October 2013 (http://rdp.cnie.msu.edu/), contains 
2 809 406 aligned and annotated bacterial and archaeal 



small subunit (SSU) rRNA gene sequences and 62 860 
fungal large subunit (LSU) rRNA gene sequences. The 
majority of rRNA gene sequences in the RDP database 
are incomplete. Most of these are derived from sequencing 
PCR amplification products, whereas a small number of 
older entries derive from reverse transcriptase sequencing 
of isolated rRNA. Because PCR amplification makes use 
of primers to conserved regions internal to the genes, few 
of these sequences cover the 3' and 5' ends of the genes 
(Figure 1). StiU, a diverse selection of complete gene 
sequences, mostly derived from genome sequencing, is 
available. Only a relatively small percentage of bacterial 
and archaeal sequences originate from organisms in 
culture; roughly 85% and 97%, respectively, of bacterial 
and archaeal sequences in RDP are from DNA directly 
isolated from environmental samples. 

Over the past several years we have been approached by 
a number of researchers interested in using RDP tools for 
analysis of fungi in the environment. With the latest 
release, we are providing both an alignment of fungal 
28S rRNA gene sequences and a fungal training set for 
the RDP Classifier leveraging a recently pubhshed phylo- 
genetically consistent taxonomic mapping (4). For our 
new fungal alignment, the number of sequences covering 
positions in the 5' end of the gene is much higher than for 
the 3' end (Figure IC). The 28S gene is much longer than 
the bacterial 16S gene, and many fungal researchers 
appear to find that sequencing 5' regions of the 28S gene 
provides sufficient phylogenetic resolution for strain 
differentiation. 

RDP offers tools for browsing and searching the data 
collections, for taxonomic classification and nearest- 
neighbor search, for primer-probe testing and for tree 
building. In addition, RDPipehne tools are specifically 
designed for processing high-volume amplicon sequence 
data. New tools have been designed with speed and 
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Figure 1. Gene coverage: number of sequences from RDP release 11.1 
covering the indicated positions on the reference sequence. (A) Bacterial 
SSU rRNA gene. Positions relative to Escherichia coli sequence 
GenBank accession JO 1695.1. Gray bars indicate variable regions (1). 
(B) Archaeal SSU rRNA gene. Positions relative to E. coli sequence 
GenBank accession JO 1695.1. (C) Fungal LSU rRNA gene. Positions 
relative to 5. cerevisiae GenBank accession NC_001 144.5 LSU gene. 
Dl and D2 indicate hypervariable regions initially used for discrimin- 
ation among Fusarium spp. (2). The D2 region is among the most 
highly variable eukaryotic LSU regions in terms of both length and 
structure (3). Such high diversity may improve the performance of the 
RDP Classifier when discriminating between closely related genera. 
Gene coverage charts are available online and updated with each in- 
cremental RDP release. 



capacity in mind, and most previously published tools 
have been updated to accommodate the recent changes 
of the sequencing technology. Many RDP tools are also 
available as open-source stand-alone packages. 



RDP DATA COLLECTIONS 

RDP obtains most of its rRNA sequences from the 
International Nucleotide Sequence Database 
Collaboration (INSDC; 5) databases. To prepare an RDP 
release, data files from the 'standard' dataclass and taxo- 
nomic divisions 'Prokaryotes', 'Fungi' and 'Environmental 
Samples' are downloaded from the European Nucleotide 
Archive (ENA; 6) ftp site. Records are examined for an 
'rRNA' feature key of minimum length 500 bases (to 
allow sufficient context for taxonomic classification). If 
such record is found within an accession not in the RDP 
database, or within an existing accession but with a newer 
modification date, the sequence defined by the 'rRNA' 
feature is extracted. These new sequences are then filtered 
using a version of the RDP SeqMatch tool (described 
below) trained on a small hand-curated set of bacterial, 
archaeal, eukaryotic and mitochondrial SSU sequences 
and fungal LSU sequences. Only sequences having a best 
match to bacterial, archaeal or fungal with an S_ab score of 
at least 0.3 are saved. The original INSDC annotations, 
including structured comments such as Genomic 
Standards Consortium MIxS-compliant comments (7), are 
also captured. Many organism names in the INSDC 
records are not up to date. We obtain the most recent 
validly pubhshed synonym from Bacterial Nomenclature 
Up-to-Date (http://www.dsmz.de/). 

Each sequence is aligned and classified using the corres- 
ponding RDP Aligner and Classifier (see below). Any 
sequence with a negative ahgnment bit saved score is dis- 
carded. The sequence is assigned to the lowest taxon in the 
RDP taxonomy with classification bootstrap confidence of 
80% or above. Sequences passing this quality filtering are 
then subjected to the following tagging process. Sequences 
from type strains are tagged as 'type'. Any sequence with 
accession hsted in the bioproject.xml file (available from 
NCBI ftp site; 8) is tagged as a genome project sequence. 
All 16S rRNA gene sequences are screened for chimeric 
sequences using UCHIME (9) in reference mode. Positive 
UCHIME results are tagged as 'suspect' sequences. Next, 
the NCBI taxonomic assignment (10) is determined using 
the taxonomy ID in the db_xref qualifier obtained from 
the INSDC annotation. Any sequence assigned under an 
NCBI taxon with name containing 'environmental', 'un- 
cultured' or 'uncultivated' is tagged as an 'uncultured' 
sequence. For each release, a set of flat files containing 
the entire sequence collection for each of the three genes 
are available for download in ahgned or unaligned 
FASTA, and annotated GenBank formats. With each 
release, RDP provides the resource files for the NCBI 
LinkOut service. This allows researchers to jump directly 
to RDP sequence records from the corresponding records 
in the NCBI's Nucleotide and BioProject databases (11). 

Alignments 

The sequences in the RDP database are ahgned using 
Infernal, a stochastic context-free grammar-based aligner 
(12). This ahgner has several advantages: it incorporates 
secondary-structure information into the alignment 
process; as a model-based aligner, new sequences can be 
easily added to a pre-existing ahgnment; it is fast enough 
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for very large numbers of sequences. The bacterial and 
archaeal aligners were trained using secondary structure 
information from the Comparative RNA Web Site (CRW; 
13) and training ahgnments we developed with 2591 bac- 
terial and 144 archaeal full-length sequences mostly from 
sequenced genomes, respectively. The bacterial training 
sequences have greater coverage (27 phyla) than those 
used for RDP release 10 (16 phyla). Many rRNA genes 
in genome sequences are annotated with incorrect start or 
stop positions. We adjusted these to produce consistent 
endpoints for the training set. We optimized the Infernal 
ahgner parameters, particularly the relative entropy, to 
provide improved handling for partial sequences. Models 
and training sequences are available for download from 
the RDP website. 

The V6 region is especially hard to place into a multiple 
sequence alignment because much of the region is not 
conserved in size, sequence or secondary structure; 
however, the high diversity of the V6 region makes it a 
very common ampHfication target. Available tools often 
do not attempt to produce a multiple sequence alignment 
for amplicons of this region, but instead score pairwise 
ahgnments to a set of reference sequences for analysis 
(14). The tuned Infernal 1.1 ahgner is able to correctly 
ahgn the less hypervariable positions in the short region 
amplified by commonly used V6 primers (15), producing 
an alignment for this region matching that produced with 
full-length sequences (Figure 2). 

We also tested this new alignment model by comparing 
the alignment produced using this model with the CRW 
bacterial seed alignment, which is hand-curated to match 
secondary structure. The Infernal alignment placed 92.7% 
of the bases in ahgnable positions (columns), the remain- 
der corresponded to positions not conserved in the bac- 
terial rRNA structure. We found that in 99.3% of the 
cases, pairs of residues in an ahgnable column of the 
Infernal alignment were mapped together to a column in 
the CRW alignment. 

The fungal alignment uses a model built with 183 LSU 
sequences from complete fungal genomes plus the CRW 
fungal set, and covers four major fungal phyla: 
Ascomycota, Basidiomycota, Chytridiomycota and 
Blastocladiomycota. To develop the training model, we 
used a combination of the CRW general eukaryotic con- 
servation model and Saccharomyces cerevisiae secondary 
structure model. In the large ribosomal subunit, the 5.8S 
and 28S molecules form a common secondary structure 
and the training model included the combined 5.8S and 
28S gene sequences. (The Internal Transcribed Spacer 
ITS2 between 5.8S and 28S evolves too rapidly for 
global ahgnment and is treated as an insert in our 
model.) This fungal LSU Ahgner is especially useful for 
ahgning sequences resulting from protocols that amphfy 
and sequence from all or part of the 5.8S gene to the 5' 
portion of the 28S gene while not compromising alignment 
of sequences of only the 28S gene. 

Taxonomy 

RDP bases its bacterial and archaeal taxonomies on the 
taxonomic roadmaps pubhshed by Bergey's Trust (http:// 



www.bergeys.org/outlines.html). As these are updated 
only at long intervals, we capture changes in taxonomic 
information and the publication of new species from 
the literature and from the List of Prokaryotic Names 
with Standing in Nomenclature website (16). We modify 
this taxonomy by adding clades for groups with few 
cultured relatives based on pubhshed informal 
taxonomies, where available. We compare this with the 
phylogenetic assessment from the All-Species Living 
Tree Project (17) and to our own assessment using the 
RDP Classifier. When there is a discrepancy between 
these sources, we conduct our own phylogenetic assess- 
ment by creating trees from the ahgned sequences, 
including sequences from Hterature clades, and accept 
those clades best supported. 

The fungal taxonomy used by RDP is the recently pub- 
hshed taxonomy hand-developed using pubhshed 
phylogenies for different taxa and taxonomic databases 
(4) with updates. Because rRNA genes are too slowly 
evolving to rehably separate the vahdly named species 
(18), genus is the lowest rank presented in the RDP 
taxonomy. Where available, both genus and specific 
epithet, along with strain identifiers, are maintained for 
each sequence, but are not used to group or sort se- 
quences. For species where our phylogenetically 
informed taxonomic assessment differs from the formal 
nomenclatural genus portion of the species binomial, the 
phylogenetically incorrect (but vahd) name is maintained 
with the sequence and will differ from the assigned taxo- 
nomic hneage. 

Using the pairwise distances generated with the 
enhanced distance calculation tool included in the 
mcClust package (described below) and the sequences 
and taxonomic data available from the RDP database, 
we computed accumulation curves for the group size 
and intra-taxa distances at the genus, family, order, class 
and phylum level for each domain (Figure 3). 



TOOL DESCRIPTIONS 

The RDP website provides an interactive interface to 
the RDP database and tools. RDP tools that accept 
sequence input from researchers do so via file upload or 
direct entry into a text field on the tool page. Recognized 
sequence file formats include FASTA, FASTQ, GenBank 
and EMBL formats. Most tools display results in a taxo- 
nomic hierarchy view that allows interactive browsing. 
Results are saved in session until the start of a new 
task, allowing researchers to switch between tools 
without losing their results. From most tools, RDP se- 
quences can be selected and saved to a SeqCart, which 
in turn can be used by other RDP tools as input, or down- 
loaded as aligned or unahgned sequence files, or as a 
distance matrix from the download page. Batch loading 
into the SeqCart is possible by uploading a file containing 
a list of INSDC accession numbers or RDP identifiers. 
Most tools are available as open-source command-line 
versions from the RDP GitHub repository (http:// 
github . com/rdpstaff/) . 
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A --AAGGCUUGACAUA.UAGGG. . aAAACUGGCAGAGAUGNCAGGuccgc . . aagggCCCUA-UAC 

--UGGGUUUGACAUG.CACUG. . gAUCGCCUCAGAGAUGGGGUUuccgu . . aagguCAGUG - UGC 

--CGGGCUUAAAUUGcACCUG. . . AAUAAUGUGGAAACAUGUUAgccg . . . uaaggCAGGUGUGA 

--AGGGUUUGAAAUC.CUAGCugcUCAGCCAUAGAAACAUGGCUuccuucgaggguGCUAG-GAC 

--AAGGCUUGACAUG.GGGUG. . aAAACUCGUAGAGAUACGGGGuccgc . . aagggCGCCU -CAC 

--AAGGCUUGACAUA.CACGA. . gAACGGGCCAGAAAUGGUCAAcucuuuggaccgUCGUG-AAC 

{{{{{{{. . . . ,<<<.<<<<<. . .--<<<<<< < > >>>>>> >>>>>->>>, [[-[ 



B 

--AAGGCUUGA. . . . CAU - . AUAGGg . .aAAACUGGCAGAGAUGNCAGGuccgc aagggcCCUAUA . .C 

--UGGGUUUGA. . . .CAUG.CACUG. . . gAUCGCCUCAGAGAUGGGGUUuccguaaggucagugugC . . 

--CGGGCUUAAauugCACC.U . . . gAAUAAUGUGGAAACAUGUUAg c-CGUAA. . GGCAGGUG - UGA- 

--AGGGUUUGA. . . . AAU - . CCUAGcugcUCAGCCAUAGAAACAUGGCUuccu uc . . . . gagggugCUAGGA . .C 

--AAGGCUUGA. . . .CAUGg-GGUG. . .aAAACUCGUAGAGAUACGGGGuccgc aagggCGCC-U. .CAC 

--AAGGCUUGA. . . .CAUA.CACGA. . . gAACGGGCCAGAAAUGGUCAAcucuuu ggaccgUCGUGAac 

{{{{{{{, , , <<<.<<<<<. . . .--<<<<<< < > >>>>>> >>>>>-. .>>>, [[-[[[[-- 

Figure 2. Multiple sequence alignment of partial bacterial 16S rRNA sequences corresponding to the region between common V6 variable region 
amplification primers (15). Uppercase columns correspond to modeled positions. Lowercase columns correspond to regions where hypervariability in 
size and structure preclude assignment of homologous residues. These columns are normally 'masked out" before phylogenetic analysis. (A) Using the 
new RDP 11 alignment model. This matches the alignment for this region obtained with full-length sequences. (B) Using the RDP 10 alignment 
model. The alignment of the full-length sequences is almost identical in this V6 region between the two models, except one G-U pair in RDP 1 1 
appears as inserts in the RDP 10 alignment. Bases highlighted in green color are canonical base pairs matching the conserved secondary structure. 
From top to bottom, the GenBank accessions are AB006164, AB006178, AB021164, AB015577, AB003932 and AB004715. 
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Figure 3. Accumulation curves showing (A) taxon size and (B) intra-taxon distance. All aligned sequences in RDP release 11.1 in each of the three 
RDP collections were clustered as described. The average distance between pairs of sequences in a taxon is shown in (B). The shape of the phylum 
curves, and to a lesser extent class curves, for archaea and fungi, are likely influenced by the sinall nuinber of taxa and the skewed representation of 
sequences in these taxa. 



RDP Browsers 

RDP Browsers provide interactive web interfaces to 
RDP's sequence collections. The Hierarchy Browser 
enables researchers to navigate, search and select se- 
quences from the RDP collections displayed in either the 
RDP or NCBI taxonomy. Data set options allow re- 
searchers to examine subsets of sequence records based 
on any combination of the foUowing options: type or 
non-type strain sequences, uncultured or isolated organ- 
isms, partial or near-fuU-length sequences and suspect 
quaUty or good quahty sequences. A search feature 
allows researchers to enter a word or words to be 



matched against sequence annotation. Advanced search 
features include Boolean logic, regular expressions and 
abihty to limit search to a specified annotation field. The 
'display depth' can be modified to control the number of 
ranks shown in the hierarchy. Researchers can add or 
remove sequences from their SeqCart by selecting individ- 
ual sequences or entire taxa. 

Two additional specialized browsers are available. The 
Genome Browser organizes rRNA sequences from 
genome projects and provides rRNA copy number and 
genome size plus link-outs to additional genome informa- 
tion hosted at other sites. Publication View organizes 
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sequences by publications. Sequences for any individual 
publication can be displayed and selected in Hierarchy 
Browser. 

mvRDP is an account-based workspace that allows 
researchers to upload and store their pre-publication 
sequences. The facihty is meant for single sequences up 
to groups of several hundred. These can be partial or 
complete rRNA sequences assembled from genomes or 
metagenomes, or sequences assembled from low-volume 
sequencing of rRNA gene clone hbraries. The RDP 
Amplicon Sequence Pipehne (RDPipehne) described 
below is better suited for new amplicon sequencing 
technologies such as Illumina MiSeq. Sequences are 
uploaded in sequence groups to mj'RDP, and these 
groupings are maintained. Upon uploading, sequences 
are automatically submitted for alignment and classifica- 
tion. These myKY)¥ sequences can then be analyzed in 
combination with sequences from RDP's collections 
using RDP's tool suite. A special social network feature 
allows sequence groups to be shared with additional re- 
searchers ('research buddies') specified by sequence 
owners. This feature is especially useful for remote 
collaborations. 

Sequence Match ( SeqMatch ) 

This is one of the most often used online RDP tools. It is a 
re-implementation of the original RDP 
SIMILARITY_RANK (19,20). SeqMatch finds closest 
RDP sequences to a query based on the fraction of 
shared seven-base sequence fragments (words) between 
the query and reference sequences (S_ab score). 
SeqMatch works well on partial- and full-length sequences 
and is more accurate than BLAST (21) at identifying 
database sequences that are closely related to query 
rRNA sequences. 

The online SeqMatch is a ^'-nearest neighbor classifier 
and displays each query sequence under the lowest 
common ancestor taxon consistent with the k top 
matches for that query. In Detail View, these top k 
matches are all presented in a taxonomic hierarchy 
display similar to the Hierarchy Browser. 

The standalone SeqMatch is available from the RDP 
GitHub repository. It requires an input sequence file, a 
reference sequence file and optional S_ab cutoff and k 
value. The output file contains the following information 
for each k top matches to a query: query name, match 
sequence ID, orientation, S_ab score and the number of 
unique common 7-mers. 

Classifier 

The RDP Classifier rapidly and accurately assigns se- 
quences into taxa with bootstrap value, an estimate of 
confidence for each assignment (22). The RDP Classifier 
has several advantages over most other methods of clas- 
sifying rRNA sequences, especially for large high- 
throughput sequencing datasets: high speed with 
minimum memory requirement, does not require ahgn- 
nient, works well for partial sequences and can be easily 
retrained with alternative taxonomy or for different genes. 
The onhne RDP Classifier is pre-trained for bacterial and 
archaeal 16S and for fungal 28S rRNA gene sequences 



(see 'Taxonomy' for more detail). The bacterial and 
archaeal 16S training set has been updated seven times 
since the first release to reflect changes in taxonomic 
opinions. The online tool takes input query sequences 
and a choice of training set. The results are shown in a 
taxonomic hierarchy view displaying aU the taxon nodes 
with sequences assigned to them. Researcher can change 
the 'confidence threshold' to choose a cutoff suitable for 
the dataset. For partial sequences, using a lower confi- 
dence cutoff has been shown to increase the classification 
coverage at genus rank with sufficient accuracy (23). A 
detailed view shows individual queries assigned to each 
taxon. 

The current version of the Classifier incorporates a 
number of enhancements not covered in the initial pubh- 
cation. The bootstrap assignment strategy has been 
changed to avoid an over-prediction problem when 
multiple genera are tied for highest score during bootstrap 
trials. The Classifier now allows multiple sample inputs. 
Expanded output options include detailed classification 
assignment for each sequence and an output file with 
one column for each sample containing assignment 
counts for each taxon. The latter is in a format appropri- 
ate for beta diversity analysis and sample ordination, 
and produces results similar to those obtained from 
operational taxonomic unit (OTU) clustering-based 
methods (24). 

The command-line Classifier (available from the RDP 
GitHub repository) provides extensive support for retrain- 
ing, allowing researchers to rapidly test the consistency of 
their training sets and to flag possible errors in their 
custom taxonomies. There is no requirement for a 
uniform number of taxonomic ranks and uncommon 
ranks are correctly handled. The classification speed is 
proportional to the number of genera, not the number 
of training sequences. This allows custom training sets 
with very large numbers of sequences. However, a larger 
training set with less accurate assignments or taxonomic 
irregularities will not necessarily work well — the testing 
tool can help vahdate a new training set. These features 
have allowed researchers to retrain the RDP Classifier on 
a broader range of sequences, including ones from envir- 
onmental clades (25), on honeybee gut specific 16S rRNA 
sequences (26) and on fungal LSU sequences (4). 

Library Compare 

RDP Library Compare (22) is used to investigate statis- 
tical differences between a pair of sample libraries. Instead 
of estimating overall difference between samples, 
LibCompare provides P values for determining statistical 
significance of abundance differences for individual taxa. 
This tool first uses the RDP Classifier to assign sequences 
to taxa. Depending on the abundance of sequences 
assigned to each taxon, one of two statistical tests is 
used to compute a P value to determine if a taxon is dif- 
ferentially represented in the two hbraries. 

The standalone Library Compare is available as part of 
the RDP Classifier package. It produces an output con- 
taining assignment detail result for each query, and a 
tab-delimited file containing comparison results sorted 
by P value. Each line contains the P value, taxon rank, 
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taxon name and the number of assignments from each 
sample. 

Probe Match 

Probe Match (20) performs a search against a sequence 
dataset for matches to the entered ohgonucleotide se- 
quences (primer). This tool implements a fast bit-vector 
algorithm for approximate substring matching (27). The 
onhne Probe Match takes primer sequences in Standard 
lUPAC code (allowing degenerate bases). There is an 
option to check a pair of primers in tandem, effectively 
testing in silico PCR. Researchers choose which of the 
three RDP sequence collections to search: Bacteria, 
Archaea or Fungi. Researchers can also limit the search 
to only the sequences containing a specified region of the 
molecule. This does not limit the search to that region, but 
by removing partial sequences missing the expected target 
site, it gives a more accurate estimate of primer coverage. 

The standalone Probe Match general-purpose search 
engine (available from the RDP GitHub repository) 
requires an input sequence file and one or more primer 
sequences. Input sequences can be from any genes and of 
any length, but primers must not be longer than 64 char- 
acters. Multiple primers can be used at the same time, but 
for each sequence only the result for the best matching 
primer is reported. The output file contains sequence 
IDs that match at least one of the primer(s) within the 
specified distance and the detailed information of the 
match. 

RDP Aligner 

RDP offers two ways for researchers to align sequences. 
Any bacterial or archaeal 1 6S gene sequences uploaded to 
a researcher's myKDV account are automatically ahgned. 
Researchers can also align bacterial and archaeal 16S as 
weU as fungal 28S sequences using the Ahgner on the 
RDPipeline website. The online Ahgner uses the same 
Infernal ahgnment models used to process the RDP 
database sequences (see 'Alignment' section above). The 
Ahgner has been updated to Infernal version 1.1. This 
version is 7.5 times faster than the previous version used 
with RDP release 10. Since the standalone Infernal does 
not check the orientation of sequences; the onhne RDP 
Ahgner first checks the orientation of each sequence and 
reverse-complements if necessary before aligning. The 
RDPipeline contains a suite of tools for further processing 
aligned sequence sets (see below). The Infernal 1.1 models 
used in RDP release 1 1, as well as the Infernal 0.81 models 
used in RDP release 10 are available from the RDP 
GitHub repository. 

Tree Builder 

Tree Builder constructs a phylogenetic tree from se- 
quences selected from the RDP collections and «n'RDP 
in any combination. This tool uses the Weighbor (28) 
weighted neighbor joining method with Jukes-Cantor cor- 
rected distances calculated from the RDP alignment. 
Bootstrap confidence estimates for each tree node are 
calculated from 100 bootstrap resamphngs of the 



ahgnment columns. Normally an outgroup sequence 
should be included to allow the tree to be correctly 
rooted. The resulting tree is displayed in a Java applet 
that allows interactive exploratory manipulations, such 
as selecting nodes, and swapping branches. The trees can 
be downloaded in standard Newick format as well as in 
PS/PDF file formats. 

Assignment Generator 

The Assignment Generator provides support for a 16S 
rRNA gene analysis lesson plan (29). It introduces com- 
parative 16S rRNA gene analysis through a realistic bio- 
informatics exercise (unique sequences, common research 
tasks) that is easy to manage, distribute and evaluate using 
tools on RDP's website. An instructor can generate a set 
of unique assignments for the entire class by specifying the 
number of students in the class, the number of sequences 
for each student and the choice of dataset (among 
bacteria, bacteria and archaea, or medically important 
bacteria). The tool provides: (i) a unique set of sequences 
for each student that are derived from the RDP sequence 
collections in a way that preserves secondary structure, (ii) 
a set of directions describing assignments for the students 
and (iii) an evaluation key for the instructor with the 
expected results for each student. This tool has been 
used in classes of sizes up to more than 500 students. 

RDPipeline for high-throughput amplicon analysis 

The RDPipeline performs several common processing 
steps in taxonomy-dependent (using the RDP Classifier), 
and taxonomy-independent (using hierarchical clustering) 
analysis of large datasets. The RDPipeline is a new tool 
suite designed to replace our previous Pyrosequencing 
Pipeline (30), offering extended processing and analysis 
tools reflecting recent shifts in amplicon sequencing 
technologies and techniques. Researchers can utilize the 
tools in the RDPipeline tool suite in one of two ways. 
For researchers processing a moderate amount of se- 
quences, we offer online versions of the RDPipehne 
tools. For researchers involved in high-volume sequencing 
projects, or who would hke to incorporate some of our 
tools into their local custom workflow, we offer aU the 
tools that make up the RDPipeline on the RDP GitHub 
repository. 

The online RDPipeline has integrated support for 
/tjjRDP accounts. All submitted jobs are viewable from 
the 'my jobs' page. Analysis results are stored for up to 
2 weeks but job history remains available. The job history 
hsts the type of each job, current status, submission, start 
and completion times and supplied processing parameters. 
For long running jobs, an email containing a direct hnk to 
download the results is sent when processing has 
completed. All RDPipehne tools accept compressed files. 
Any compressed file will be expanded upon upload and 
the contained files are treated as the input to the tool. All 
RDPipeline tools have extended input validation checks 
on uploaded files. Before processing begins, a summary is 
displayed showing the files detected and any files not used 
because of unexpected file type. 
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'Initial processing' prepares raw sequences from a 
sequencing facility for analysis. It is a multi-step process 
that includes sorting the raw reads by sample tag, 
trimming off tag and primer regions and removing se- 
quences of low quahty. Input file can be a single file or a 
compressed file containing multiple sequence files. For 
paired-end data, it uses our Assembler (described below; 
Figure 4) to assemble overlapping paired reads as the first 
step. We recommend researchers analyzing paired-end 
data use a read Q score cutoff of around 25-27 to filter 
out low-quahty assembled reads. Summary statistics are 
included in the download for each tag, including the 
number of sequences filtered by each filter and a histogram 
of sequence lengths after filtering. 

Alignment 

This tool allows researchers to align up to 1 000 000 
Bacterial/Archaeal 16S or Fungal 28S sequences at a 
time with Infernal 1.1 using the RDP ahgnment models. 
Uploaded sequences for all genes are checked for orienta- 
tion and reverse complemented when necessary. Each 
ahgnment job result also includes alignment position and 
length statistics as well as a summary histogram of read 
ahgnment start and end positions relative to the alignment 
model. 

Clustering 

The complete linkage clustering tool (29,30) allows users 
to upload ahgned sequences to be clustered as the first step 
in taxonomy-independent analysis. Sequence files can be 
clustered together with each file treated as a sample, or 
files can be clustered separately. The online clustering tool 
is limited to 150000 unique sequences per job. For clus- 
tering very large datasets, we provide a modified version 
of mcClust (31) for download (see below). This new 
version distributes distance calculations among a 
compute cluster and incorporates algorithmic changes 
that lower the time complexity and speed up clustering. 

Ecological measures 

The cluster file obtained from Clustering or mcClust can 
be used to compute five common ecological measures for 



their samples. Alpha diversity can be estimated using 
Shannon or Chaol and beta diversity can be measured 
using the Jaccard or Sorensen indices. Researchers can 
also assess sequencing depth using the rarefaction tool. 

Sequencer run quality checks 

The RDPipehne includes two tools, Defined Community 
Analysis and Chimera Check for assessing the quality of 
sequence runs (31). The latter is powered by UCHIME 
(9). For researchers who include a defined community 
sample in their sequencer run, the Defined Community 
Analysis Tool calculates the observed error rates based 
on the known gene sequences of the organisms in the 
defined community. 

Additional tools 

Researchers can use the 'Cluster File Format Conversion' 
tool to convert RDP Cluster files to an OTU table format, 
suitable for input to R and estimates, or to the BIOM 
format (32). The 'Alignment Merger' tool allows re- 
searchers to merge ahgnment files created independently. 
The 'Sequence Selector' tool allows researchers to upload 
a set of sequence files and a separate file containing a Hst 
of IDs. A file is returned either containing only the se- 
quences specified, or excluding them, depending on 
option selected. The 'Representative Sequence Selector' 
tool aUows researchers to upload a cluster file and 
retrieve a 'representative sequence' from each cluster, 
defined as the sequence with least sum of squared dis- 
tances to all other sequences in the cluster. 

mcClust enhancements 

Hierarchical sequence clustering methods that worked 
weU for thousands of amplicon sequences often fail with 
the increased output of the latest sequencing technologies. 
Exact clustering methods require all pairwise distances for 
the input sequences and thus scale on the order of 0(«^). 
Many clustering implementations, in addition to requiring 
0(«^) computational time, also have a memory complexity 
of 0(n'^) as they store aU distances in memory. 
Nonetheless, clustering methods remain an important 
tool in rRNA sequence analyses, and several groups 
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Figure 4. Comparing per base error rates for three paired-end read assembly tools. The error rates were calculated using assembled reads filtered by 
either read Q score (Assembler and original PANDAseq; 38) or delta Q score (mothur; 39). Recommended read Q score of 27 for Assembler and 
base Q score (deltaq) of 6 for mothur are marked. (A) Sample M_20130714 and (B) Sample M_20130819. 
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have attempted to solve the scaHng issues facing sequence 
clustering. One approach is to adopt an approximate clus- 
tering method, as in USEARCH (33) and CD-HIT (34), 
which use heuristics to limit the number of pairwise com- 
parisons calculated. Another approach proposed by 
Loewenstein et al. (35) focuses on limiting the memory 
complexity of average linkage clustering by storing the 
distances on disk. To utilize disk storage for the pairwise 
distances, they must be in sorted order. With a general 
purpose sorting algorithm this increases the time complex- 
ity to 0(«^log n^). 

Several previously published complete hnkage algo- 
rithm implementations take advantage of on disk 
storage of distances to limit the memory requirements 
for clustering (31,36,37). These implementations still 
require all pairwise distances (or at least all pairwise 
distances up to a maximum distance cutoff), and more 
importantly require sorting of all these distances. We 
propose a distance calculation tool with the goal of 
being efficient in the distance calculations, allowing 
the distance matrix computation to be parallelized 
and using an alternative sorting method to reduce the 
time complexity back to 0{n^) (see Supplementary 
material). 

The distance calculation tool is implemented in Java 1.6 
and has been integrated as a tool into the mcClust package 
(31) available on the RDP GitHub repository. 

Assembler for paired-end reads 

Compared to single-stranded lUumina reads, assembled 
paired-end reads can provide longer sequences with 
lower error rates. However, newly developed paired-end 
assembly tools have limitations. We have extended the 
existing PANDAseq (38) paired-end reads assembly 
program. Our modified PANDAseq (Assembler) 
performs a modified statistical analysis using the sequen- 
cer supplied quality (Q) scores to find the most likely 
overlap, computes assembled Q scores for the read 
overlap region and handles more complex overlap 
layouts (see Supplementary material for details). 

We have tested Assembler using two defined community 
samples from two different MiSeq runs. Both runs passed 
the lUumina MiSeq quahty standards but the basic per 
base error rates of these two samples are quite different 
(0.17% for sample M_20130714 and 0.7% for sample 
M_20130819 after assembly). Both are within the 
reported error rate range for paired-end MiSeq amphcon 
data (0.28-1.08%) (39). Using an overall read Q score 
quality filter to remove low quality sequences, we tested 
the Assembler against the paired-end assembler and 
quality filter built into mothur (39), another amphcon 
analysis program. Assembler slightly outperformed 
mothur on the high-quality dataset (Figure 4A), and sig- 
nificantly outperformed on the average-quahty dataset 
(Figure 4B). In both datasets. Assembler outperformed 
the original PANDAseq when scored in a similar 
manner (although such Q score based filtering was not a 
goal of that implementation). Using a read Q score of 27 
decreases the error rates to 0.05% and 0.16% for 
M_20130714 and M_20130819, respectively, and was 
effective in selectively removing reads with a high 



number of errors (Supplementary Figure S3). The 
Assembler is integrated into Initial Processing and is avail- 
able on the RDP GitHub repository. 

All three programs can be run with multiple threads but 
were limited to a single thread in our testing. On an AMD 
Opteron 8384 quad-core 2.7 GHz processor, it took 
Assembler 1.4 h to assemble over 16 milhon reads from 
one MiSeq run. The original PANDAseq took 20min, 
while mothur took 21.3 h to assemble the same set of 
data using its recommended analysis protocol, on the 
same system. 

USER SUPPORT 

RDP's mission includes user support. RDP online tools 
are each supplied with a help page as a quick reference for 
its functionality, algorithm and how-tos. An RDP Wiki 
provides an updated searchable repository for answers to 
commonly asked questions compiled from previous user 
communications with RDP staff. Workflow tutorials 
guide researchers through common task-oriented 
processes, provide sample data and introduce researchers 
to the best practices for NGS data analysis. For 
command-hne tools, step-by-step instructions and 
sample data files are provided on the RDP GitHub reposi- 
tory. Support questions can be emailed to 
rdpstaff@msu.edu. Telephone support is available (+1 
517 432 4998). 

AVAILABILITY OF SUPPORTING DATA 

The sequence data from this study have been submitted to 
the EN A Short Read Archive (http://www.ebi.ac.uk/ena/) 
under accession no. PRJEB4878. 



SUPPLEMENTARY DATA 

Supplementary Data are available at NAR Online. 
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